Statistical modelling in R for ecologists (Part 1)
Whenever you embark on a journey in statistical modelling, it’s good to be reminded of the following quote from George E.P. Box,
‘All models are wrong, but some are useful.’.
We build models to help us understand ecological relationships. But no model is perfect. Ecological data in particular is fraught with a lot of random, natural variation in observations that our models will be unable to explain.
Modelling goals can be three-fold:
- Exploration, e.g. which environmental variables are associated with species richness?,
- Inference, e.g., do protected areas cause increased species richness?,
- Prediction, e.g., use correlations between environmental variables and species abundance to predict species habitat suitability across space.
How we build and select models depends on our goals, see this guide to goal driven model selection in ecology.
Here, we will focus on learning the basics of fitting models to answer ecological questions. You will be able to use these modelling skills regardless of whether your goal is exploration, inference, or prediction (model development and selection for different goals is beyond the scope of this coarse).
We’ll start with a basic linear model to explore how fire severity influences ecological characteristics at survey locations. We’ll see how, depending on the type of data we are modelling, we might need to use generalised linear models (as opposed to general linear models). In the part 2 of this course, we’ll build more complex multivariate models in R.
This course assumes an understanding of how to wrangle, summarise, and visualise data in R. We hope you will learn:
- The fundamentals of fitting statistical models to ecological data, and
- and how to check your model fits the data well.
General linear models
We can use linear models to estimate the strength and direction of ecological relationships. The word ‘linear’ is important - when we use linear models we assume a straight (or linear) relationship between our response variable (also referred to as dependent variables) and our explanatory variables (also referred to as covariates, predictor, or independent variables - don’t let the different terms confuse you).
In addition to linearity, there are three other assumptions: homoskedasticity (constant variance) and normality of the residuals, and independence of observations. Don’t worry about those for now though, we’ll discuss them a bit later.
Explanatory variables can be either continuous (left plot below) or categorical (right plot below).
If you took a first year undergrad stats course, you might be thinking, let’s use a t-test or ANOVA for the categorical example above! It’s a common misunderstanding, but a general linear model with a categorical explanatory variable is essentially the same thing, see this explainer.
One other important thing to note: although general linear models assumes a linear relationship between the response and explanatory variable, they can accommodate non-linearity via a non-linear transformation of the explanatory variable. For example, this quadratic relationship between the explanatory and response.
A little bit of maths
To understand how they work, it helps to take a quick look at how we formulate linear models mathematically:
\[y_i = b_0 + b_{1}X_{1i} + \epsilon_i,\]
\(y_i\) is our response variable of interest,
\(b_0\) is the is the y-intercept for the linear relationship between \(y\) and our explanatory variables (\(X\)’s),
\(b_1\) is the coefficient representing the strength and direction of the relationship between \(y\) (the slope), and
\(\epsilon\) is the random, unexplained error in our observations around the fitted relationship(s) between \(y\) and \(X\)(s).
We’ll often refer to \(\epsilon\) as the residual error.
In the above example we only have one explanatory variable (\(X\)), but we can have multiple (referred to as ‘multiple linear regression’). We would just add on a new term to our equation for every additional explanatory variable, e.g. \(b_{2}X_{2i}\).
When we’re fitting general linear models, we assume that \(\epsilon\), the error term, is drawn from a normal (i.e., gaussian) distribution. We can write that assumption mathematically as,
\[\epsilon \sim Normal(0, \sigma)\] We would expect this if our response variable (\(y\)) is continuous and unbounded (could potentially be any number between \(-∞\) and \(∞\)). (Note: In the next section, we’ll learn how to fit generalised linear models which allow us to model non-normal error distributions necessary for count data, for example.)
Simulating data to enhance our understanding
When fitting models to data, one of the most important tools at our disposal is data simulation. It’s a very important, but often under-utilised skill. We can use R to simulate data that corresponds to what we expect to observe in the real world, given what we know about the data generating process.
Simulated data has a known truth. We know exactly what the relationship between \(x\) and \(y\) is because we define it when we do the simulation (we’ll do this in a sec).
Knowing the truth is very powerful when developing statistical models. We can directly test whether our model can reliably recover the known truth. If they can, then we can be reasonably confident that they will perform well on data collected in the real world (given our assumption regarding the data generating process are correct).
Simulation is also a great way to enhance our understanding of how models work. So let’s try it.
First we will simulate data to recreate figures A) and B) above. Then we will fit linear models to the simulated data and see if they can reliably estimate the known truth.
Continuous explanatory variable
As an example, we’ll consider our response variable of interest as tree circumference, and our explanatory variable to be fire severity.
We will simulate a dataset of 100 tree circumference measurements and fire severity in the area surrounding each measurement. To do so, we’ll make some assumptions about the data generating process based on our ecological understanding trees and fire. (DISCLAIMER: I am not an expert in forest ecology, and so all of the below assumptions are just for illustrative purposes, and could very much be wrong).
We’ll assume that:
- fire causes tree death and succession, resulting in smaller average tree circumference in locations with higher fire severity,
- the relationship between fire severity and tree circumference is negative and linear, and
- and the natural, unexplained variation (error) in our measurements will be normally distributed with a mean of 0 and a standard deviation of 0.1.
The latter assumption of normally distributed errors is a safe one, given that tree circumference is a continuous variable. As we’ll see later, if the response is not discrete (like species abundance), we’ll have to assume a non-normal error distribution.
Given the above, we think that we can use a general linear model to estimate the size of the effect (i.e., the slope of the relationship) between tree circumference (\(y\)) and fire severity (\(x\)).
So, we will the equation for a linear model to simulate 100 tree circumference measurements (\(y\)):
\[y_i = b_0 + b_{1}X_{1i} + \epsilon_i.\]
We’ll assume that, on average, the magnitude of the negative effect of fire severity on tree circumference is -1, meaning a one unit increase in fire severity results in a 1 unit decrease in tree circumference. This is the \(b_{1}\) coefficient (the slope) in the above equation. When fitting models to real data, this is often the relationship that is unknown and we are trying to estimate.
We’ll also assume that, on average, in locations without fire (i.e., fire severity = 0), tree circumference is equal to 5 (on average). This is the \(b_0\) coefficient (the y-intercept) in the above equation and is usually unknown.
Fire severity (\(X\)) will be measured on a continuous scale between 0 to 1, where 0 = no fire and 1 = highest level of fire severity.
As mentioned above (C)), we expect natural variation in tree circumference measurements (i.e., variation not due to fire severity, \(\epsilon\)) to be normally distributed with a mean of 0 and standard deviation of 0.1.
Simulating data
First let’s set up our simulation parameters - they are constants in the linear model.
b0 <- 5 # average tree circumference when fire severity is = 0 (i.e, y-intercept)
b1 <- -1 # beta coefficient representing association (slope) between tree circumference and fire severity
e_mu <- 0 # mean error
e_sd <- 0.1 # error standard deviationGreat, now all we need to simulate tree circumference is fire severity measurements (\(X\)) and error (\(\epsilon\)). We need 100 of each (n = 100). We’ll draw fire severity measurements randomly from a uniform distribution with a minimum value of 0 and a max of 1.
n <- 100 # number of tree circumference measurements
X <- runif(n, min = 0, max = 1) # fire severity measurements
e <- rnorm(n, e_mu, e_sd) # natural errorNow we can simulate tree circumference using our linear equation: \(y_i = b_0 + b_{1}X_{1i} + \epsilon_i.\).
y <- b0 + b1*X + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
head(df) Fire_severity Tree_circumference
1 0.12582265 4.781367
2 0.24228965 4.844527
3 0.42392394 4.703584
4 0.04750893 4.732653
5 0.81288162 4.316959
6 0.83146610 4.026111
Looks good. Now let’s plot it
library(ggplot2)
ggplot(df) +
aes(x = Fire_severity, y = Tree_circumference ) +
geom_point() +
ylab('Tree circumference') +
xlab('Fire severity') +
geom_smooth(method = 'lm') +
theme_classic()`geom_smooth()` using formula = 'y ~ x'
Tip Do your numbers look slightly different from mine? To make our results reproducible, whenever we use a random number generator in R, we should set the seed.
set.seed(123) # set the seed for the random number generator
X <- runif(n, min = 0, max = 1) # fire severity measurements
e <- rnorm(n, e_mu, e_sd) # natural error
y <- b0 + b1*X + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
# plot it
ggplot(df) +
aes(x = Fire_severity, y = Tree_circumference ) +
geom_point() +
ylab('Tree circumference') +
xlab('Fire severity') +
geom_smooth(method = 'lm') +
theme_classic()`geom_smooth()` using formula = 'y ~ x'
Fitting linear models to simulated data to recover known truths
Great, we have successfully simulated tree circumference observations! The relationship looks as we expected, with tree circumference decreasing with increasing fire severity (plus/minus some natural, random variation that is unexplained by fire severity).
Now we want to know - can we accurately estimate the known truths, i.e., the y-intercept \(b_0\) and the slope \(b_{1}\), from our simulated data with a linear model?
m <- lm(Tree_circumference ~ Fire_severity, data = df)
m
Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)
Coefficients:
(Intercept) Fire_severity
4.999 -1.009
Pretty close to our known truths. What if we have fewer observations though, or more? What if the standard deviation of the error \(\epsilon\) is higher or lower? Try changing n and e_sd to see the consequences.
How are the beta coefficients of the model (the intercept \(b_0\) and the slope \(b_{1}\)) estimated? For a simple, general linear model like this, we (meaning R) uses Ordinary Least Squares (OLS) to estimate the model coefficients. If you’re a visual learner like me, check out this great tool to see how OLS works. In a nutshell, we’re iteratively rotating the straight line between our \(y\) and \(X\) variables until we get the best fit (i.e. have minimised the sums of squared error).
Checking the model fit
summary(m)
Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.223797 -0.061323 -0.001973 0.059633 0.221723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99910 0.01961 254.99 <2e-16 ***
Fire_severity -1.00898 0.03418 -29.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09693 on 98 degrees of freedom
Multiple R-squared: 0.8989, Adjusted R-squared: 0.8979
F-statistic: 871.3 on 1 and 98 DF, p-value: < 2.2e-16
Let’s unpack this model summary.
First up, summary statistics for the residuals. Residuals are the difference between predictions of tree circumference made by our model, and the actual observations of tree circumference. So a residual value of 0 means the model perfectly predicted tree circumference for a particular level of fire severity. Negative or positive residuals means the model is under- or over- predicting tree circumference for some levels of fire severity.
The residual summary statistics give us an indication of whether the residuals are normally distributed and if, on average, the model predictions are close to observed values (the min/max and quantiles are approximately the same and the median is close to 0). Ours look great!
Then we have the beta coefficients (y intercept and slope) estimated by the model, along with standard error of the estimates and a p-value indicating whether they are likely to be different from 0 (the lower the p-value, the less likely they are to be equal to 0). Ssince we are simulating data, we would expect these to be the same as our known truths for \(b0\) and \(b1\), which they are!
b0[1] 5
b1[1] -1
Residual standard error is how variable the residuals are, and degrees of freedom is the number of observations (in our case 100) and the number of parameters the model is estimating (in our case 2, because we are estimated 2 beta coefficients (coefficients are parameters)).
The R2 is a measure of model fit (use the adjusted if you have multiple explanatory variables). It tells us our model has explained a whopping 89% of the variability in the data (note, use the adjusted R2 if you have more than one explanatory variable).
That’s alot - the model is doing well! But remember this is not real data (we simulated it) - in ecology we usually see much lower R2 values due to the natural sources of variability that we can’t explain.
Tip Try increased the standard deviation of the natural variation we’ve simulated (e_sd) and see how that changes the R2.
The F-statistic and corresponding p-value tell us our model is any better than an ‘intercept-only’ model (the lower the p-value the more likely our model is to be explaining the data well).
In addition to how much variability in the data our model is explaining, we also want to want to make sure the model isn’t violating any structural assumptions, namely that the model residuals are:
- Normally distributed, and have
- Homogeneity of variance.
The residual summary statistics already gave us an indication that they are normally distributed. But we should also use model diagnostic plots to check these assumptions:
par(mfrow = c(2,2))
plot(m)These diagnostic plots show that our fitted model does not violate any structural model assumptions. Specifically, the residuals vs. predicted (top left) indicates homogeneity of variance because we see stars in the night sky (no strong patterns). Also the Q-Q plot (top right) shows 1:1 relationship between the theoretical quantiles of a normal distribution and the residuals, indicating they are normally distributed.
Let’s look at what happens if we try to fit a model that doesn’t fit the data well. Let’s simulate data with a non-linear (cubic) relationship between fire severity and tree circumference and try to fit a linear model to it.
y <- b0 + b1*X^3 + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
# plot it
ggplot(df) +
aes(x = Fire_severity, y = Tree_circumference ) +
geom_point() +
ylab('Tree circumference') +
xlab('Fire severity') +
geom_smooth(method = 'lm', col = 'red') +
geom_smooth(method = 'loess', col = 'blue') +
theme_classic()`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
We can already see that assuming a linear relationship between fire severity and tree circumference (the red line) is unlikely to yield a model that fits the data well. But let’s try fitting the linear model anyway and seeing what the model diagnostics say about the model fit.
m <- lm(Tree_circumference ~ Fire_severity, data = df)
summary(m)
Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.40736 -0.08783 0.01631 0.09659 0.32965
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.20466 0.02820 184.56 <2e-16 ***
Fire_severity -0.91264 0.04917 -18.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1394 on 98 degrees of freedom
Multiple R-squared: 0.7785, Adjusted R-squared: 0.7763
F-statistic: 344.5 on 1 and 98 DF, p-value: < 2.2e-16
Already we see the explanatory power of the model is lower than before (R2 of 78%). Note also that our estimates of the model coefficients are slightly biased compared to before.
Now let’s check the diagnostics.
par(mfrow = c(2,2))
plot(m)The QQ plot looks ok, but what about the residuals vs. fitted? The residuals clearly to not exhibit homogeneity of variance. In fact, the u-shaped pattern in the residuals vs. fitted plot gives us a clue that the model is not fitting the data well because there is a non-linear relationship between the response and explanatory variable.
What can we do? We can try transforming the explanatory variable the same way we did when simulating the data to produce the non-linear relationship. If we transform the explanatory variable, the linear model still assumes a linear relationship between the response and the transformed explanatory variable.
m <- lm(Tree_circumference ~ I(Fire_severity^3), data = df)
summary(m)
Call:
lm(formula = Tree_circumference ~ I(Fire_severity^3), data = df)
Residuals:
Min 1Q Median 3Q Max
-0.22587 -0.06074 0.00113 0.05736 0.22499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99370 0.01292 386.55 <2e-16 ***
I(Fire_severity^3) -0.99621 0.03485 -28.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09696 on 98 degrees of freedom
Multiple R-squared: 0.8929, Adjusted R-squared: 0.8918
F-statistic: 817.1 on 1 and 98 DF, p-value: < 2.2e-16
Note the use of I in the model formula - this means ‘as is’. It allows us to transform the explanatory variable within the model forumla. What do you notice about the coefficient estimates and R2?
Now the diagnostics.
par(mfrow = c(2,2))
plot(m)Much better, no strong patterns in the residuals.
Sometimes, non-linear relationships cannot be adequately captured by a parametric, linear model. In this case, you might consider trying a generalised additive model. This is outside the scope of this coarse, but {mgcv} in R is excellent for this.
Categorical explanatory variable
Instead of a continuous measurement of fire severity, perhaps we only know whether our locations have been burned or are un-burned. Can we still use a linear model to estimate the difference in tree circumference (on average) between burned and un-burned locations?
Let’s simulate data with a categorical explanatory variable for fire (Burned vs. Un-burned), instead of a continuous measure of fire severity.
We will make all the same assumptions as before. Although our known truth for \(b_{1}\) is slightly different now that we have a categorical predictor, i.e., a \(b_{1}\) equal to -1 means that tree circumference, on average, is 1 unit lower at burned locations compared to un-burned locations.
Let’s simulate the data and plot it.
set.seed(123) # set random number generator for reproducibility
# simulation parameters
b0 <- 5 # average tree circumference when fire severity is = 0 (i.e, y-intercept)
b1 <- -1 # beta coefficient representing association (slope) between tree circumference and fire severity
e_mu <- 0 # mean error
e_sd <- 0.1 # error standard deviation
# simulation variables
n <- 100 # number of tree circumference measurements
X <- rep(c(0,1), n/2) # burned vs. un-burned
e <- rnorm(n, e_mu, e_sd) # natural error
# simulate y
y <- b0 + b1*X + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
#plot
ggplot(df) +
aes(x = factor(Fire_severity), y = Tree_circumference) +
geom_boxplot() +
geom_jitter(alpha = 0.5) +
ylab('Tree circumference') +
xlab('Fire severity') +
geom_smooth(method = 'lm') +
theme_classic()`geom_smooth()` using formula = 'y ~ x'
Note that the categories for fire severity are 0 and 1. The 0 corresponds to ‘Un-burned’ and the 1 corresponds to ‘Burned’ locations. We should (and easily can) label these as such for clarity, but I’ve purposefully left them as 0’s and 1’s here to illustrate something important about what linear models do when estimating coefficients for categorical variables.
Linear models dummy code categorical explanatory variables so that the reference level (in this case ‘Un-burned’) is equal to 0, and the non-reference level(s) (in this case ‘Burned’) are equal to 1. Let’s fit the model and look at the summary.
m <- lm(Tree_circumference ~ Fire_severity, data = df)
summary(m)
Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.237948 -0.057986 -0.000856 0.059773 0.209864
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.01105 0.01297 386.31 <2e-16 ***
Fire_severity -1.00402 0.01834 -54.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09172 on 98 degrees of freedom
Multiple R-squared: 0.9683, Adjusted R-squared: 0.968
F-statistic: 2995 on 1 and 98 DF, p-value: < 2.2e-16
Great, so once again, we’ve been able to recover the known truth (an effect size of -1) with our linear model.
Note that if you have categorical variables with >2 levels, all of the ‘non-reference’ levels are dummy coded as 1’s. So, you end up estimating the effect of each level on the response, in reference to the ‘reference’ level. So the \(b_{1}\) coefficient of -1 means that, on average, tree circumference is one unit less at burned sites, in reference to un-burned sites.
As before, we would want to check our model meets structural assumptions using diagnostic plots.
par(mfrow = c(2,2))
plot(m)hat values (leverages) are all = 0.02
and there are no factor predictors; no plot no. 5
Looks great.
Generalised linear models
What if, in addition to tree circumference, we also wanted to ask whether species abundance is lower in sites with high fire severity?. Could we just fit the same model as before, and swap out circumference for abundance?
Remember that one of the assumptions of the linear models we’ve been using so far is that \(\epsilon\), the error term, is normally distributed. Continuous data like circumference conforms to this assumption. Counts of individuals (i.e., species’ abundance), however, can only be whole numbers - we can’t count half a bird - and is bounded from 0 to \(∞\). Residuals from models fitted to count data will therefore not be normally distributed.
So, we need to generalise our linear model a bit to accommodate this non-normal error structure. Generalised linear models allow us to accommodate many different types of non-normal error structures, including errors from count data. Once again, simulation can help us to understand how generalised linear models work. So let’s try it.
Simulating data with non-normal error structure
For our previous general linear model of tree circumference, we drew errors \(\epsilon\) from a normal distribution with a mean of 0. An alternative for count data is the poisson distribution.
The poisson is a probability distribution for discrete, whole numbers like count data, is bounded from 0 to \(∞\), and has one parameter \(lambda\) for the mean and variance (which are assumed to be equal). (Note, however, ecological count data often violates the latter assumption, in which case there are alternatives like the negative binomial - we’ll look at this later).
n <- 1000
df <- data.frame(norm = rnorm(n, 1, 1), poiss = rpois(n, 1))
a <- ggplot(df) +
aes(x = norm) +
ylab('Frequency') +
xlab('Random number') +
ggtitle('A) Normal distribution') +
geom_histogram(fill = 'cyan') +
theme_classic()
b <- ggplot(df) +
aes(x = poiss) +
ylab('Frequency') +
xlab('Random number') +
ggtitle('B) Poisson distribution') +
geom_histogram(fill = 'seagreen') +
theme_classic()
a+b`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
So, we can use the poisson distribution to simulate data that mimics counts. But how can we ensure that the predictions of counts from our linear equation \(y_i = b_0 + b_{1}X_{1i}\) will be constrained between 0 and \(∞\) instead of \(-∞\) and \(∞\)?
This is where the link function comes in. Generalised linear models use link functions to transform model predictions from a linear equation so that they are bounded as necessary. For the poisson, the canonical link function is the log, meaning all predictions from a generalised linear model with a poisson family and log link will be between 0 and \(∞\).
Mathematically this looks like: \[log(\lambda_i) = b_0 + b_{1}X_{1i},\] Remember \(\lambda\) is the mean species abundance (and also the variance). And \(\lambda\) is what we need to simulate counts of abundance from a poisson distribution,
where \[y_i \sim Poisson(\lambda_i).\]
The inverse of the natural log is the exponent. So, to predict mean species abundance \(\lambda\), we can take the exponent of the linear combination of the intercept and explanatory terms, like this: \[\lambda_i = exp(b_0 + b_{1}X_{1i}).\] We can then use these predicted values of \(\lambda\) to simulate counts of species abundance from a poisson distribution. Let’s simulate counts this way below.
Before we do though, one more thing to note. By taking the exponent of the linear combination of explanatory variable(s), we’re imposing a non-additive, multiplicative relationship between \(X\) and \(y\). This changes how we interpret the \(b_{1}\) coefficient. Instead of representing the additive change in \(y\) for every one unit increase in \(X\), \(b_{1}\) now represents the multiplicative change.
Here we go.
set.seed(123)
# All counts predicted by a poisson GLM will be on the log scale.
# So here we log our 'known truths' for the intercept and beta coefficients
b0 <- log(10) # average species abundance when fire severity is = 0 (i.e, intercept) is 10
b1 <- -log(1/0.5) # species abundance decreases by a factor of 2 (aka 50% decrease in abundance) for a one unit increase in fire severity
n <- 100 # number of observations
X <- runif(n, min = 0, max = 1) # fire severity measurements
lambda <- exp(b0 + b1*X) # estimate mean species abundance (lambda) on multiplicative scale using the exponent
y <- rpois(n, lambda) # simulate species abundance observation from each lambda value
df <- data.frame(Fire_severity = X, Species_abundance = y) # make a dataframe of observations
ggplot(df) +
aes(x = Fire_severity, y = Species_abundance) +
geom_point() +
ylab('Species abundance') +
xlab('Fire severity') +
geom_smooth(method = 'lm') +
#geom_smooth(method = 'loess') +
theme_classic()`geom_smooth()` using formula = 'y ~ x'
Looks just as we would expect, only whole numbers for species abundance, and species abundance decreases by half for a one unit increase in fire severity (10*0.5 = 5).
Let’s see what happens when we try to fit a regular linear model (which assumes a linear relationship between \(X\) and \(y\) and a normal error structure).
m <- lm(Species_abundance ~ Fire_severity, data = df)
summary(m)
Call:
lm(formula = Species_abundance ~ Fire_severity, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.8689 -1.6252 -0.4015 1.2453 6.7348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2042 0.4986 20.464 < 2e-16 ***
Fire_severity -5.6446 0.8694 -6.493 3.49e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.465 on 98 degrees of freedom
Multiple R-squared: 0.3008, Adjusted R-squared: 0.2936
F-statistic: 42.16 on 1 and 98 DF, p-value: 3.488e-09
Right away, we can see that the slope (\(b_{1}\)) is biased from the known truth.
Now check the diagnostics.
par(mfrow = c(2,2))
plot(m)Not perfect, but actually not too bad given we know this is a mis-specified model. Let’s try a generalised linear model with a poisson family (error distribution) and a log link function instead.
m <- glm(Species_abundance ~ Fire_severity, data = df, family = poisson(link = 'log'))
summary(m)
Call:
glm(formula = Species_abundance ~ Fire_severity, family = poisson(link = "log"),
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.36214 0.06847 34.50 < 2e-16 ***
Fire_severity -0.77409 0.13211 -5.86 4.64e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 112.957 on 99 degrees of freedom
Residual deviance: 78.035 on 98 degrees of freedom
AIC: 460.49
Number of Fisher Scoring iterations: 4
Okay, so what’s happened here? The intercept coefficient look worse than before?
The model has estimated the coefficients on the ‘link scale’, which is in this case the log scale. The inverse of the natural log is the exponent. Let’s back transform and see what we can recover our known truths.
exp(coef(m)) (Intercept) Fire_severity
10.6136780 0.4611211
Voila. Close to what we expected - a mean species abundance of 10 when fire severity is equal to 0 (the y intercept), and for this to be reduced by ~50% with a one unit increase in fire severity. (Tip: try upping the sample size (n) in the simulation and see if you get closer to the known truth.)
What else is unusual about this model summary? Now we have null and residual deviance instead of R2? How come? To estimate coefficients for generalised linear models we can no longer use ordinary least squares (OLS) like we did with general linear models. Instead we use maximum likelihood for parameter estimation which, in a nutshell, means we choose coefficients that maximise the log-likelihood of our model to the observed data.
A consequence of maximum likelihood parameter estimation is that we now measure model fit as deviance - which is the deviation of our fitted model from a perfect, ‘saturated’, model. So, the higher the residual deviance, the worse the model fit (a residual deviance of 0 implies a perfect model fit). Null deviance is the deviation of an ‘intercept only’ worst model from a perfect model.
Now check diagnostics. Better than the mis-specified linear model, but not by a lot. This highlights that it is important to rely on your knowledge of the data generating process when building models, not just verification tools.
par(mfrow = c(2,2))
plot(m)Count data and overdispersion
Often, ecological count data is over-dispersed. This means there is more unexplained variation in the data than the model we’ve fitted expects. Remember, the poisson distribution assumes that the mean (average species abundance) equals the variance. But often in ecological data we see greater variance in abundance estimates for larger values. Furthermore, we often have a larger number of 0 counts than the poisson dictates, which can also lead to overdispersion. (Note that under-dispersion can also occur, but is less common in ecology.)
How can we check for over-dispersion? First, we can divide the residual deviance (the amount of unexplained variation we have in our data, after fitting the model) by the residual degrees of freedom (the amount of left over variation our model expects). It should not be > 1.5. If it is, over-dispersion is likely a problem in our model fit.
m$deviance/m$df.residual[1] 0.7962746
So overdispersion is not a problem here (and also we didn’t see anything indicative of overdispersion in the above diagnostic plots).
The rule of thumb is, if the ratio is >1.5, then overdispersion needs to be dealt with. We can try a negative binomial distribution instead of a poisson to deal with overdispersion. Alternatively, sometimes overdispersion is driven largely by an excess number of 0s in our count data (typical in ecology). In this case we might try a zero-inflated poisson or zero-inflated negative binomial.
We can also check the diagnostic plots. If we see a ‘fan-shape’ in the residuals vs. fitted plot, implying errors are greater when fitted (predicted) abundance is higher, we can suspect that our data is over-dispersed for the model we are trying to fit to it. Of coarse, we didn’t see overdispersion.
Let’s simulate over-dispersed species abundance and see for ourselves. To simulate this data, we’ll use the negative binomial distribution instead of the poisson. In contrast to the poisson, the mean of a prediction is not expected to equal the variance. Instead, we can define the mean and variance separately, and use this to simulate overdispersed counts.
We’ll keep everything as before, but this time will use lambda at the mean counts to simulate from the negative binomial, setting the dispersion parameter to 0.8.
set.seed(123)
df2 <- data.frame(df, Species_abundance_overdisp = rnbinom(n, mu = lambda, size = 0.8))
ggplot(df2) +
aes(x = Fire_severity, y = Species_abundance_overdisp) +
geom_point() +
ylab('Species abundance (over-dispersed)') +
xlab('Fire severity') +
geom_smooth(method = 'lm') +
#geom_smooth(method = 'loess') +
theme_classic()`geom_smooth()` using formula = 'y ~ x'
Just as we expected, there is greater variability in species abundance than before. Now lets fit a glm assuming a poisson family (even though we know we should be using negative binomial) and check for overdispersion.
m <- glm(Species_abundance_overdisp ~ Fire_severity, data = df2, family = poisson(link = 'log'))
summary(m)
Call:
glm(formula = Species_abundance_overdisp ~ Fire_severity, family = poisson(link = "log"),
data = df2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.36227 0.06919 34.144 < 2e-16 ***
Fire_severity -0.86019 0.13509 -6.368 1.92e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 854.05 on 99 degrees of freedom
Residual deviance: 812.67 on 98 degrees of freedom
AIC: 1114.1
Number of Fisher Scoring iterations: 5
What do you notice right away about the model fit? The deviance is a lot higher than before (remember, if residual deviance = 0 it implies a perfect fit).
m$deviance/m$df.residual[1] 8.292546
The ratio is much greater than 1.5! The data is certainly over-dispersed for the poisson GLM (of course we expected this since we simulated the data from a negative binomial). Let’s just confirm with diagnostic plots too.
par(mfrow = c(2,2))
plot(m)Not good. But, when dealing with high dispersion data (such as count data), we often wouldn’t expect residuals to be normally distributed anyway.
{DHARMa} offers a nice solution, simulating quantile residuals, which we would expect to be normally distributed if the glm is a good fit to the data, even in high dispersion (different from overdispered) situations.
library(DHARMa)This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
resid <- simulateResiduals(fittedModel = m, plot = F)
plot(resid)The quantile residuals confirm that overdispersion is a major problem in this model. So let’s try fitting a negative binomial glm.
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:patchwork':
area
m <- glm.nb(Species_abundance_overdisp ~ Fire_severity, data = df2, link = 'log')
summary(m)
Call:
glm.nb(formula = Species_abundance_overdisp ~ Fire_severity,
data = df2, link = "log", init.theta = 0.7956894295)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3028 0.2376 9.691 <2e-16 ***
Fire_severity -0.7319 0.4183 -1.750 0.0802 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.7957) family taken to be 1)
Null deviance: 117.23 on 99 degrees of freedom
Residual deviance: 113.65 on 98 degrees of freedom
AIC: 605.6
Number of Fisher Scoring iterations: 1
Theta: 0.796
Std. Err.: 0.129
2 x log-likelihood: -599.598
Summary indicates over-dispersion is no longer a problem (ratio < 1.5).
m$deviance/m$df.residual[1] 1.159728
Let’s check our quantile residuals to confirm this model is a good fit.
resid <- simulateResiduals(fittedModel = m, plot = F)
plot(resid)Much better.
But, notice that the coefficient estimates of both the poisson and negative-binomial glms fitted to over-dispersed data were not overly biased from our known truths.
b0[1] 2.302585
b1[1] -0.6931472
So why do we care? Well, the standard error of the coefficients is much lower than in the mis-specified model compared to the correct one. So, if we don’t deal with over dispersion, we could wind up incorrectly inferring there is a significant effect, when there isn’t.
Resources
What we haven’t covered, and where you can learn more:
Interactive effects (i.e., the response of \(y\) to one explanatory variable depends on another); to find out more see this guide for ecologists
Random effects (i.e., mixed effects models) - most models in ecology should probably be mixed effects models! Random effects offer many benefits, including allowing you to account for lack of independence in your observations (e.g., site level effects, spatial and/or temporal autocorrelation) I recommend checking out this book
Model selection - how to choose among competing, plausible models (e.g., models that include different combinations of explanatory variables). As mentioned in the introduction, you can learn more about approaches to goal-driven model selection here
Here we have learned how to fit models in a frequentist paradigm, using either ordinary least squares or maximum likelihood. As a follow-up, we will learn how to fit models in a Bayesian paradigm to allow us to fit more complex multivariate models in Part 2.